Skip to content

feat(proxy): durable http bridge ownership#250

Draft
aaiyer wants to merge 34 commits intoSoju06:mainfrom
aaiyer:feature/durable-http-bridge-ownership
Draft

feat(proxy): durable http bridge ownership#250
aaiyer wants to merge 34 commits intoSoju06:mainfrom
aaiyer:feature/durable-http-bridge-ownership

Conversation

@aaiyer
Copy link
Copy Markdown
Collaborator

@aaiyer aaiyer commented Mar 22, 2026

Why

HTTP bridge turn-state continuity currently depends on an in-memory alias map. When a proxy process restarts, evicts a session, or a replayed request lands on another replica, the proxy can only observe that the local alias is missing. That collapses distinct situations into the same failure shape and can report the wrong error even when the original bridge is already gone.

This feature makes HTTP bridge ownership durable across restarts and replica boundaries so the proxy can tell the difference between:

  • a true live-owner mismatch on another healthy replica
  • a stale or expired bridge session that can be recovered
  • an invalid turn-state token
  • a replay that must fail closed because previous-response continuity is required

Without that distinction, valid replays can fail after eviction or restart, while real ownership conflicts are indistinguishable from missing local state.

What Changed

  • Added signed, versioned HTTP bridge turn-state tokens.
  • Added durable http_bridge_leases storage plus repository wiring for live-owner tracking.
  • Recovered stale or expired bridge sessions when continuity is optional.
  • Kept fail-closed behavior for true live-owner conflicts and invalid continuity tokens.
  • Hardened reconnect and lease-refresh behavior so stale lease rows do not block valid replays.

Validation

  • uv run ruff check app/modules/proxy/api.py app/modules/proxy/service.py tests/integration/test_http_responses_bridge.py tests/unit/test_pricing.py tests/unit/test_proxy_utils.py
  • uv run ruff format --check app/modules/proxy/api.py app/modules/proxy/service.py tests/integration/test_http_responses_bridge.py tests/unit/test_pricing.py tests/unit/test_proxy_utils.py
  • .venv/bin/python -m pytest -q tests/unit/test_pricing.py::test_calculate_cost_from_usage_legacy_gpt_5_tiers_preserve_priority_and_flex_rates tests/unit/test_proxy_utils.py::test_has_native_codex_transport_headers_requires_allowlisted_originator tests/integration/test_http_responses_bridge.py::test_v1_responses_http_bridge_signed_turn_state_recovery_preserves_stable_affinity tests/integration/test_http_responses_bridge.py::test_v1_responses_http_bridge_touch_closed_session_does_not_recreate_lease tests/integration/test_http_responses_bridge.py::test_v1_responses_http_bridge_touch_false_after_close_does_not_recreate_lease tests/integration/test_http_responses_bridge.py::test_v1_responses_http_bridge_persists_lease_with_actual_idle_ttl tests/integration/test_http_responses_bridge.py::test_v1_responses_http_bridge_precreated_retry_ignores_lease_refresh_failure -x

chatgpt-codex-connector[bot]

This comment was marked as outdated.

@aaiyer aaiyer marked this pull request as draft March 22, 2026 19:32
@aaiyer aaiyer marked this pull request as ready for review March 22, 2026 19:47
Repository owner deleted a comment from chatgpt-codex-connector bot Mar 22, 2026
Repository owner deleted a comment from chatgpt-codex-connector bot Mar 22, 2026
@aaiyer

This comment was marked as outdated.

chatgpt-codex-connector[bot]

This comment was marked as outdated.

@aaiyer

This comment was marked as outdated.

chatgpt-codex-connector[bot]

This comment was marked as outdated.

@aaiyer aaiyer marked this pull request as draft March 24, 2026 00:43
@aaiyer aaiyer force-pushed the feature/durable-http-bridge-ownership branch from 2e98ed6 to 8fb2e2c Compare March 25, 2026 18:03
@aaiyer aaiyer marked this pull request as ready for review March 25, 2026 18:03
Repository owner deleted a comment from chatgpt-codex-connector bot Mar 25, 2026
@aaiyer
Copy link
Copy Markdown
Collaborator Author

aaiyer commented Mar 25, 2026

@codex review

@aaiyer
Copy link
Copy Markdown
Collaborator Author

aaiyer commented Mar 25, 2026

@Soju06 this change is important because the HTTP responses bridge was previously only “locally correct” inside a single worker’s memory, but not durable across the cases that matter in production: worker restarts, PID reuse, replica routing, reconnect handoff, and stale signed x-codex-turn-state replays. The goal of this branch is to make bridge ownership and continuity explicit and durable, so we either recover correctly onto the right session/replica or fail closed with the right error instead of drifting into broken continuity.

The painful part was that this behavior sits at the intersection of several different state machines:

  1. the in-memory bridge session map
  2. the durable lease row in the DB
  3. replica/worker ownership rules
  4. reconnect and reader teardown timing
  5. turn-state aliasing and replay semantics

Every time one path was fixed, another adjacent path exposed an assumption that had previously been “accidentally working.” A good example is that recovering a stale signed turn-state, preserving reconnect lease handoff, and keeping creation-time lease persistence overridable all sound like separate problems, but they were actually coupled through the same bridge lifecycle. That is why the review cycle felt endless: most follow-up fixes were not random bugs, they were hidden invariants being surfaced one by one.

In practice, the hardest parts were:

  • making lease ownership durable without breaking existing reconnect behavior
  • distinguishing valid local alias replay from stale-token replay that must now expire
  • preserving test hooks and failure contracts while tightening the actual semantics
  • getting the bridge logic to satisfy both the behavioral integration tests and the stricter static typing/CI checks

So yes, this was a pain, but it was the necessary kind of pain. The branch is not just adding edge-case handling; it is forcing the bridge to have a consistent contract around ownership, recovery, and continuity loss. Without that, the failures stay nondeterministic and show up as “sometimes wrong instance,” “sometimes previous_response_not_found,” or “sometimes reconnect lost the bridge,” which is much worse to operate.

If we keep iterating in this area after this PR, the next worthwhile cleanup would be to split _get_or_create_http_bridge_session() into explicit outcomes like live_alias_reuse, stale_recovery, expired_continuity, and wrong_instance, because that’s where most of the pain came from.

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Another round soon, please!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@aaiyer aaiyer marked this pull request as draft March 25, 2026 19:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant